{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Bcn0KDNdYTAi"
      },
      "source": [
        "<img src=\"https://github.com/chdb-io/chdb/raw/main/docs/_static/snake-chdb.png\" height=100>\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Wunb8L9tYTAk"
      },
      "source": [
        "Inspired by ClickHouse Blog: [ANN Vector Search with SQL-powered LSH & Random Projections](https://clickhouse.com/blog/approximate-nearest-neighbour-ann-with-sql-powered-local-sensitive-hashing-lsh-random-projections).\n",
        "\n",
        "This demo will show how to use chDB to make a simple search engine for a set of movies.\n",
        "```\n",
        "movieId,embedding\n",
        "318,\"[-0.32907996  3.2970035   0.15050603  1.9187577  -5.8646975  -3.7843416\n",
        " -2.6874192  -6.161338    1.98583    -2.6736846   2.1889842   5.162994\n",
        "  1.654852   -0.7761136   1.5172766  -0.85932654]\"\n",
        "296,\"[-0.01519391  2.443479   -1.480839    0.10609777 -5.6971617  -1.3988643\n",
        " -4.1634355  -6.399832    4.8691964  -2.7901962   1.738929    3.839515\n",
        "  1.5430368   1.4577994   0.56058794 -0.9734406 ]\"\n",
        "356,\"[-1.8876978   1.6772441  -1.9821857  -0.93794477 -2.5182424  -3.8408334\n",
        " -3.87617    -4.512172    0.8053944  -2.081389    1.454333    6.7315516\n",
        "  0.22428921  0.72071487  2.211912   -1.3959718 ]\"\n",
        "593,\"[-1.4681095   2.4807196  -2.990346    0.239727   -5.800576   -2.9217808\n",
        " -2.9491336  -6.646222    4.2070146  -2.650232    0.6342644   5.38617\n",
        "  1.0954435  -0.71700466  0.43723348 -0.8792468 ]\"\n",
        "2571,\"[-2.5742574   1.3490096  -2.0755954   3.0196552  -7.46083    -3.2669234\n",
        " -5.8962264  -4.022377    0.9717742   0.75643456  3.016018    4.7698874\n",
        " -0.34867725  3.7842882   0.4231439  -0.81689113]\"\n",
        "```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fd2f-qkrYTAk"
      },
      "source": [
        "# Recommendation systems these years\n",
        "\n",
        "The recommendation system has made several major advancements over the past 10 years:\n",
        "\n",
        "1. 2009-2015: LR (Logistic Regression) combined with sophisticated feature engineering defeated SVM and collaborative filtering, which were algorithms of the previous generation.\n",
        "1. 2012-2015: NN (Neural Networks) changed the CV (Computer Vision) and NLP (Natural Language Processing) industries, then returned to the recommendation system field, greatly reducing the importance of traditional skill in feature combination.\n",
        "1. 2013: Embedding was taken out from Google's archives and later developed into techniques like Item2vec, sparking a trend in mining user behavior.\n",
        "1. 2015-2016: Wide & Deep inspired \"grafting\" NN with various old models.\n",
        "1. 2016-2017: Experienced a strong counterattack from tree models such as XGBoost and LightGBM that were fast, good, and efficient.\n",
        "1. 2017: Transformer became popularized to the point where \"Attention Is All You Need.\"\n",
        "1. 2018-now: Mainly focused on deep exploration of features, especially user features. Representatively famous is DIEN."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CuW_ZcBlYTAl"
      },
      "source": [
        "# About this demo\n",
        "\n",
        "Item2vec technology is developed based on Word2vec. Its core idea is to treat the user's historical behavior sequence as a sentence, and then train the vector representation of each item through Word2vec. Finally, item recommendations are made based on the similarity of item vectors. The core of Item2vec technology is to treat the user's historical behavior sequence as a sentence, and then train the vector representation of each item through Word2vec. Finally, item recommendations are made based on the similarity of item vectors.\n",
        "\n",
        "The main purpose of this demo is to demonstrate how to train the vector representation of items using Word2vec and make item recommendations based on the similarity of item vectors. It mainly consists of 4 parts:\n",
        "1. Prepare item sequences based on user behavior.\n",
        "2. Train a CBOW model using the Word2Vec module of the gensim library.\n",
        "3. Extract all embedding data and write it to chDB.\n",
        "4. Perform queries on chDB based on cosine distance to find similar movies to the input movie."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yWBWVBEqYTAl"
      },
      "source": [
        "\n",
        "# Briefing about Word2Vec\n",
        "\n",
        "Word2Vec was introduced in two papers by a team of researchers at Google, published between September and October 2013. Alongside the papers, the researchers released their implementation in C. The Python implementation followed shortly after the first paper, courtesy of Gensim.\n",
        "\n",
        "The fundamental premise of Word2Vec is that words with similar contexts also have similar meanings and consequently share a comparable vector representation within the model. For example, \"dog,\" \"puppy,\" and \"pup\" are frequently used in analogous situations with similar surrounding words like \"good,\" \"fluffy,\" or \"cute.\" According to Word2Vec, they will thus possess a corresponding vector representation.\n",
        "\n",
        "Based on this assumption, Word2Vec can be utilized to discover relationships between words in a dataset, calculate their similarity, or employ the vector representation of these words as input for other applications such as text classification or clustering.\n",
        "\n",
        "<img src=\"https://mccormickml.com/assets/word2vec/skip_gram_net_arch.png\" alt=\"Word2Vec\" style=\"max-width:800px\">"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 43,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "vj8qJpu1YTAl",
        "outputId": "b7cf5b31-012a-4519-903e-3bbe81dc3283"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Name: tensorflow\n",
            "Version: 2.15.0.post1\n",
            "Summary: TensorFlow is an open source machine learning framework for everyone.\n",
            "Home-page: https://www.tensorflow.org/\n",
            "Author: Google Inc.\n",
            "Author-email: packages@tensorflow.org\n",
            "License: Apache 2.0\n",
            "Location: /usr/local/lib/python3.10/dist-packages\n",
            "Requires: absl-py, astunparse, flatbuffers, gast, google-pasta, grpcio, h5py, keras, libclang, ml-dtypes, numpy, opt-einsum, packaging, protobuf, setuptools, six, tensorboard, tensorflow-estimator, tensorflow-io-gcs-filesystem, termcolor, typing-extensions, wrapt\n",
            "Required-by: dopamine-rl\n",
            "---\n",
            "Name: chdb\n",
            "Version: 1.0.2\n",
            "Summary: chDB is an in-process SQL OLAP Engine powered by ClickHouse\n",
            "Home-page: https://github.com/auxten/chdb\n",
            "Author: auxten\n",
            "Author-email: auxtenwpc@gmail.com\n",
            "License: Apache-2.0\n",
            "Location: /usr/local/lib/python3.10/dist-packages\n",
            "Requires: \n",
            "Required-by: \n",
            "---\n",
            "Name: gensim\n",
            "Version: 4.3.2\n",
            "Summary: Python framework for fast Vector Space Modelling\n",
            "Home-page: https://radimrehurek.com/gensim/\n",
            "Author: Radim Rehurek\n",
            "Author-email: me@radimrehurek.com\n",
            "License: LGPL-2.1-only\n",
            "Location: /usr/local/lib/python3.10/dist-packages\n",
            "Requires: numpy, scipy, smart-open\n",
            "Required-by: \n",
            "---\n",
            "Name: numpy\n",
            "Version: 1.23.5\n",
            "Summary: NumPy is the fundamental package for array computing with Python.\n",
            "Home-page: https://www.numpy.org\n",
            "Author: Travis E. Oliphant et al.\n",
            "Author-email: \n",
            "License: BSD\n",
            "Location: /usr/local/lib/python3.10/dist-packages\n",
            "Requires: \n",
            "Required-by: albumentations, altair, arviz, astropy, autograd, blis, bokeh, bqplot, chex, cmdstanpy, contourpy, cufflinks, cupy-cuda11x, cvxpy, datascience, db-dtypes, dopamine-rl, ecos, flax, folium, geemap, gensim, gym, h5py, holoviews, hyperopt, ibis-framework, imageio, imbalanced-learn, imgaug, jax, jaxlib, librosa, lida, lightgbm, matplotlib, matplotlib-venn, missingno, mizani, ml-dtypes, mlxtend, moviepy, music21, nibabel, numba, numexpr, opencv-contrib-python, opencv-python, opencv-python-headless, opt-einsum, optax, orbax-checkpoint, osqp, pandas, pandas-gbq, patsy, plotnine, prophet, pyarrow, pycocotools, pyerfa, pymc, pytensor, python-louvain, PyWavelets, qdldl, qudida, scikit-image, scikit-learn, scipy, scs, seaborn, shapely, sklearn-pandas, soxr, spacy, stanio, statsmodels, tables, tensorboard, tensorflow, tensorflow-cpu, tensorflow-datasets, tensorflow-hub, tensorflow-probability, tensorstore, thinc, tifffile, torchtext, torchvision, transformers, wordcloud, xarray, xarray-einstats, xgboost, yellowbrick, yfinance\n"
          ]
        }
      ],
      "source": [
        "%pip install -q --upgrade tensorflow gensim chdb pandas pyarrow numpy==1.23.5 matplotlib\n",
        "%pip show tensorflow chdb gensim numpy"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "print(np.__version__)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "qSBQH4HdbESN",
        "outputId": "a1cdda30-531d-4cd2-91a2-c613dfb8543b"
      },
      "execution_count": 44,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "1.23.5\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 45,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "kCyKXViGYTAn",
        "outputId": "68e326ad-3e9d-4ea3-87e7-bf12ee69df51"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "total 1129588\n",
            "-rw-r--r-- 1 root root 435164157 Dec 14 08:14 genome-scores.csv\n",
            "-rw-r--r-- 1 root root     18103 Dec 14 08:14 genome-tags.csv\n",
            "-rw-r--r-- 1 root root   1368578 Dec 14 08:14 links.csv\n",
            "-rw-r--r-- 1 root root   3038099 Dec 14 08:14 movies.csv\n",
            "-rw-r--r-- 1 root root 678260987 Dec 14 08:14 ratings.csv\n",
            "-rw-r--r-- 1 root root     10460 Dec 14 08:14 README.txt\n",
            "-rw-r--r-- 1 root root  38810332 Dec 14 08:14 tags.csv\n"
          ]
        }
      ],
      "source": [
        "import pandas as pd\n",
        "import zipfile\n",
        "import urllib.request\n",
        "import os\n",
        "import chdb\n",
        "from chdb import session\n",
        "\n",
        "# Download and extract the dataset\n",
        "if not os.path.exists(\"ml-25m/ratings.csv\"):\n",
        "    url = \"https://files.grouplens.org/datasets/movielens/ml-25m.zip\"\n",
        "    import ssl\n",
        "    ssl._create_default_https_context = ssl._create_unverified_context\n",
        "    filehandle, _ = urllib.request.urlretrieve(url)\n",
        "    zip_file_object = zipfile.ZipFile(filehandle, \"r\")\n",
        "    zip_file_object.extractall()\n",
        "\n",
        "!ls -l ml-25m"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 46,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "e5PvRQ82YTAn",
        "outputId": "fb4b3864-45c4-4cc9-ea08-90620c468920"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "1,296,5,1147880044\n",
            "1,306,3.5,1147868817\n",
            "1,307,5,1147868828\n",
            "1,665,5,1147878820\n",
            "1,899,3.5,1147868510\n",
            "\n"
          ]
        }
      ],
      "source": [
        "# Peek at the data\n",
        "print(chdb.query(\"SELECT * FROM file('ml-25m/ratings.csv') LIMIT 5\"))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 47,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "H2kwfZR9YTAn",
        "outputId": "c821e7b3-673b-4404-8860-3b08222956d2"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\"movieId\",\"title\",\"genres\"\n",
            "1,\"Toy Story (1995)\",\"Adventure|Animation|Children|Comedy|Fantasy\"\n",
            "2,\"Jumanji (1995)\",\"Adventure|Children|Fantasy\"\n",
            "3,\"Grumpier Old Men (1995)\",\"Comedy|Romance\"\n",
            "4,\"Waiting to Exhale (1995)\",\"Comedy|Drama|Romance\"\n",
            "5,\"Father of the Bride Part II (1995)\",\"Comedy\"\n",
            "\n",
            "\"userId\",\"movieId\",\"rating\",\"timestamp\"\n",
            "1,296,5,1147880044\n",
            "1,306,3.5,1147868817\n",
            "1,307,5,1147868828\n",
            "1,665,5,1147878820\n",
            "1,899,3.5,1147868510\n",
            "\n",
            "\"userId\",\"movieId\",\"tag\",\"timestamp\"\n",
            "3,260,\"classic\",1439472355\n",
            "3,260,\"sci-fi\",1439472256\n",
            "4,1732,\"dark comedy\",1573943598\n",
            "4,1732,\"great dialogue\",1573943604\n",
            "4,7569,\"so bad it's good\",1573943455\n",
            "\n"
          ]
        }
      ],
      "source": [
        "# Create tables for the tables of movieLens dataset\n",
        "chs = session.Session()\n",
        "chs.query(\"CREATE DATABASE IF NOT EXISTS movielens ENGINE = Atomic\")\n",
        "chs.query(\"USE movielens\")\n",
        "chs.query(\n",
        "    \"CREATE VIEW movies AS SELECT movieId, title, genres FROM file('ml-25m/movies.csv')\"\n",
        ")\n",
        "chs.query(\n",
        "    \"CREATE VIEW ratings AS SELECT userId, movieId, rating, timestamp FROM file('ml-25m/ratings.csv')\"\n",
        ")\n",
        "chs.query(\n",
        "    \"CREATE VIEW tags AS SELECT userId, movieId, tag, timestamp FROM file('ml-25m/tags.csv')\"\n",
        ")\n",
        "print(chs.query(\"SELECT * FROM movies LIMIT 5\", \"CSVWithNames\"))\n",
        "print(chs.query(\"SELECT * FROM ratings LIMIT 5\", \"CSVWithNames\"))\n",
        "print(chs.query(\"SELECT * FROM tags LIMIT 5\", \"CSVWithNames\"))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MCyDHKiQYTAo"
      },
      "source": [
        "# Use word2vec to train the embeddings of movies"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 48,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "P3IJRpuJYTAo",
        "outputId": "cfcb4b95-e781-4835-b026-5ca0aaef610a"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Length of movie list:  162343\n"
          ]
        }
      ],
      "source": [
        "# Generate the movie id sequence from user ratings, the movies that have been rated >3.5 by users group by userId\n",
        "# and concat with \" \", order by timestamp\n",
        "# The movie id sequence is used to generate the movie embedding,\n",
        "# ie. user 1 rated movie 233, 21, 11 and user 2 rated movie 33, 11, 21\n",
        "# then the movie id sequence is\n",
        "# \"233 21 11\"\n",
        "# \"33 11 21\"\n",
        "movie_id_seq = chs.query(\"\"\"SELECT arrayStringConcat(groupArray(movieId), ' ') FROM (\n",
        "                            SELECT userId, movieId FROM ratings WHERE rating > 3.5  ORDER BY userId, timestamp\n",
        "                            ) GROUP BY userId\"\"\")\n",
        "\n",
        "\n",
        "# Split the movie id sequence into list\n",
        "moive_list = str(movie_id_seq).split(\"\\n\")\n",
        "\n",
        "print(\"Length of movie list: \", len(moive_list))\n",
        "# print(\"First 3 movie list: \", moive_list[:3])\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 49,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4ZJuWviPYTAo",
        "outputId": "e133ed63-5875-4af2-a56a-346279fd70cf"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Length of movie id sequence list:  162343\n",
            "Vocabulary size:  40858\n",
            "Distinct movie id count:  40858\n",
            "\n",
            "Vocabulary content:  ['318', '296', '356', '593', '2571', '260', '527', '2959', '50', '1196', '858', '1198', '4993', '110', '2858', '589', '1210', '47', '1', '7153', '5952', '608', '480', '457', '2028', '1270', '2762', '58559', '4226', '32', '150', '3578', '79132', '1136', '1193', '1704', '1197', '1221', '541', '1214', '1291', '1089', '364', '1213', '4973', '1240', '293', '4306', '1036', '1265', '588', '2329', '590', '7361', '1200', '6539', '6874', '3147', '4886', '111', '4995', '6377', '1258', '1682', '1206', '912', '750', '33794', '1617', '778', '780', '2997', '1580', '1097', '924', '1208', '1527', '595', '60069', '48516', '380', '1732', '8961', '1222', '4963', '5418', '377', '2716', '2324', '4011', '4878', '1961', '5989', '5618', '7438', '3996', '2918', '592', '733', '68954']\n"
          ]
        }
      ],
      "source": [
        "import multiprocessing\n",
        "from gensim.models import Word2Vec\n",
        "\n",
        "cores = multiprocessing.cpu_count()\n",
        "\n",
        "# Split the movie id sequence into a list of lists\n",
        "movie_id_seq_list = [seq.strip(\"\\\"\").split() for seq in moive_list]\n",
        "print(\"Length of movie id sequence list: \", len(movie_id_seq_list))\n",
        "# print(\"First 5 movie id sequence list: \", movie_id_seq_list[:5])\n",
        "\n",
        "# Train the Word2Vec model using CBOW\n",
        "model = Word2Vec(sg=0, window=5, vector_size=16, min_count=1, workers=cores-1)\n",
        "model.build_vocab(movie_id_seq_list, progress_per=10000)\n",
        "print(\"Vocabulary size: \", len(model.wv))\n",
        "\n",
        "# Check the distinct movie id with at least one rating > 3.5 count\n",
        "print(\"Distinct movie id count: \", chs.query(\"SELECT count(DISTINCT movieId) FROM ratings WHERE rating > 3.5\"))\n",
        "\n",
        "model.train(movie_id_seq_list, total_examples=model.corpus_count, epochs=10, report_delay=1)\n",
        "\n",
        "# Print model info\n",
        "print(\"Vocabulary content: \", model.wv.index_to_key[:100])\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8W8VwhiJYTAp"
      },
      "source": [
        "# Test find similar movies"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 50,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wADU4MzJYTAp",
        "outputId": "f8e2b1fe-cba9-4829-c07d-060f5e80fd1e"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Input movie:  \"Toy Story (1995)\"\n",
            "\n",
            "Top 10 similar movies: \n",
            "34,\"Babe (1995)\"\n",
            "150,\"Apollo 13 (1995)\"\n",
            "356,\"Forrest Gump (1994)\"\n",
            "364,\"Lion King, The (1994)\"\n",
            "588,\"Aladdin (1992)\"\n",
            "595,\"Beauty and the Beast (1991)\"\n",
            "1197,\"Princess Bride, The (1987)\"\n",
            "1265,\"Groundhog Day (1993)\"\n",
            "1270,\"Back to the Future (1985)\"\n",
            "3114,\"Toy Story 2 (1999)\"\n",
            "\n"
          ]
        }
      ],
      "source": [
        "input_movie_id = 1\n",
        "top_k = 10\n",
        "print(\"Input movie: \", chs.query(f\"SELECT title FROM movies WHERE movieId = {input_movie_id}\", \"CSV\"))\n",
        "print(\"Top 10 similar movies: \")\n",
        "similar_movies = model.wv.most_similar(str(input_movie_id), topn=top_k)\n",
        "print(chs.query(f\"SELECT movieId, title FROM movies WHERE movieId IN ({','.join([str(m[0]) for m in similar_movies])})\", \"CSV\"))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oMrrbaeEYTAp"
      },
      "source": [
        "# Save movieId and embeddings to a temporary CSV file"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 51,
      "metadata": {
        "id": "w94NYUhoYTAp"
      },
      "outputs": [],
      "source": [
        "import csv\n",
        "\n",
        "# Open the CSV file in write mode\n",
        "with open('movie_embeddings.csv', 'w', newline='') as file:\n",
        "    writer = csv.writer(file)\n",
        "\n",
        "    # Write the header row\n",
        "    writer.writerow(['movieId', 'embedding'])\n",
        "\n",
        "    # Iterate over each movieId and its corresponding embedding\n",
        "    for movieId in model.wv.index_to_key:\n",
        "        embedding = model.wv[movieId]\n",
        "        # Convert the format [0.1 0.2 ...] into a list of floats, eg. [0.1, 0.2, ...]\n",
        "        embedding = embedding.tolist()\n",
        "\n",
        "        # Write the movieId and embedding as a row in the CSV file\n",
        "        writer.writerow([movieId, embedding])\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d_qVqE_gYTAp"
      },
      "source": [
        "# Use brute force to find similar movies"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 52,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "tQ5xmpn6YTAq",
        "outputId": "dac28da1-78b0-47ba-92ae-90caa7696052"
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "318,\"[-2.2265753746032715,3.5254011154174805,-0.8498407602310181,2.891636848449707,-5.75970458984375,-4.4655680656433105,-2.910050392150879,-4.874805927276611,1.8407117128372192,-4.037372589111328,0.5827102065086365,3.5602872371673584,2.98940110206604,0.626388669013977,-0.8868098855018616,-2.443618059158325]\"\n",
              "296,\"[-2.526492118835449,1.1677000522613525,-0.8071125149726868,1.050546407699585,-6.952147483825684,-3.644843339920044,-4.3457231521606445,-3.9279022216796875,3.377218008041382,-3.938249349594116,0.4486609697341919,1.5516788959503174,-0.06512846052646637,2.3503451347351074,-1.0555064678192139,-3.5413312911987305]\"\n",
              "356,\"[-3.7432684898376465,2.5020227432250977,-2.216348648071289,-0.7881428003311157,-2.4738593101501465,-3.0312352180480957,-4.331355571746826,-3.506458044052124,0.7487499117851257,-3.7879791259765625,0.40928640961647034,4.914693355560303,1.905363917350769,2.227639675140381,0.6767667531967163,-3.2117021083831787]\"\n",
              "593,\"[-4.163858890533447,2.1928892135620117,-3.1881637573242188,1.3978694677352905,-5.898205757141113,-3.79733943939209,-3.5469143390655518,-4.850025653839111,2.910221815109253,-4.476861476898193,-0.45984259247779846,2.1710827350616455,1.3505152463912964,1.466415286064148,-1.6167834997177124,-3.352323293685913]\"\n",
              "2571,\"[-4.516871452331543,2.630856513977051,-1.518278956413269,1.9730091094970703,-6.271072864532471,-1.944367527961731,-5.320414066314697,-4.20072078704834,-0.20229943096637726,-1.016119122505188,2.8113770484924316,2.3726844787597656,2.287644147872925,6.298654079437256,-1.2988191843032837,-0.8754432797431946]\""
            ]
          },
          "metadata": {},
          "execution_count": 52
        }
      ],
      "source": [
        "chs.query('SELECT * FROM file(\\'movie_embeddings.csv\\') LIMIT 5')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 53,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "YV5G7Ye0YTAq",
        "outputId": "474d4c5c-02cf-4bad-b4fd-ebee980932a4"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Inserting movie embeddings into the database\n",
            "1,\"[-5.5897427,2.357738,-0.5006245,0.1455938,-2.2831314,-0.87840086,-2.5551517,-3.6584768,-0.10373427,-4.9113913,-0.30503443,5.5157366,-1.5590336,5.983279,4.385269,-3.1866481]\"\n",
            "2,\"[-5.338505,-1.5594655,-4.355125,0.069642186,1.8488991,0.33051878,-1.7176361,2.1713862,3.0727508,3.4173298,1.4632888,6.5680175,1.2017039,1.98483,4.0459323,-2.002859]\"\n",
            "3,\"[-3.5466137,1.669789,1.8515323,0.06255668,1.3897773,-4.7042356,3.6903667,-3.7350867,4.38805,5.6368246,3.0906188,2.5778446,-1.5468398,0.23956613,5.3413,-1.9792594]\"\n",
            "4,\"[-2.3924065,3.2029195,0.5825141,2.8336196,3.1721113,-1.7251801,4.581799,-1.233385,3.5758104,2.6230328,-0.72014546,1.3625188,-0.13874735,-5.3615384,3.1370401,-4.161427]\"\n",
            "5,\"[-1.5131618,1.6629226,0.3938448,-0.31881937,3.5624661,-3.0411289,3.7884533,-4.752323,5.4196496,2.502349,3.7858236,4.6285796,-1.576958,-0.91796184,6.0343866,-0.4933876]\"\n",
            "\n",
            "Movie Id, Title, Embeddings\n",
            "1,\"Toy Story (1995)\",\"[-5.5897427,2.357738,-0.5006245,0.1455938,-2.2831314,-0.87840086,-2.5551517,-3.6584768,-0.10373427,-4.9113913,-0.30503443,5.5157366,-1.5590336,5.983279,4.385269,-3.1866481]\"\n",
            "2,\"Jumanji (1995)\",\"[-5.338505,-1.5594655,-4.355125,0.069642186,1.8488991,0.33051878,-1.7176361,2.1713862,3.0727508,3.4173298,1.4632888,6.5680175,1.2017039,1.98483,4.0459323,-2.002859]\"\n",
            "3,\"Grumpier Old Men (1995)\",\"[-3.5466137,1.669789,1.8515323,0.06255668,1.3897773,-4.7042356,3.6903667,-3.7350867,4.38805,5.6368246,3.0906188,2.5778446,-1.5468398,0.23956613,5.3413,-1.9792594]\"\n",
            "4,\"Waiting to Exhale (1995)\",\"[-2.3924065,3.2029195,0.5825141,2.8336196,3.1721113,-1.7251801,4.581799,-1.233385,3.5758104,2.6230328,-0.72014546,1.3625188,-0.13874735,-5.3615384,3.1370401,-4.161427]\"\n",
            "5,\"Father of the Bride Part II (1995)\",\"[-1.5131618,1.6629226,0.3938448,-0.31881937,3.5624661,-3.0411289,3.7884533,-4.752323,5.4196496,2.502349,3.7858236,4.6285796,-1.576958,-0.91796184,6.0343866,-0.4933876]\"\n",
            "\n"
          ]
        }
      ],
      "source": [
        "# Switch to the movie_embeddings database\n",
        "chs.query(\"CREATE DATABASE IF NOT EXISTS movie_embeddings ENGINE = Atomic\")\n",
        "chs.query(\"USE movie_embeddings\")\n",
        "chs.query('DROP TABLE IF EXISTS embeddings')\n",
        "chs.query('DROP TABLE IF EXISTS embeddings_with_title')\n",
        "\n",
        "\n",
        "chs.query(\"\"\"CREATE TABLE embeddings (\n",
        "      movieId UInt32 NOT NULL,\n",
        "      embedding Array(Float32) NOT NULL\n",
        "  ) ENGINE = MergeTree()\n",
        "  ORDER BY movieId\"\"\")\n",
        "\n",
        "print(\"Inserting movie embeddings into the database\")\n",
        "chs.query(\"INSERT INTO embeddings FROM INFILE 'movie_embeddings.csv' FORMAT CSV\")\n",
        "print(chs.query('SELECT * FROM embeddings LIMIT 5'))\n",
        "\n",
        "# print(chs.query(\"SELCET * FROM movielens.movies LIMIT 5\"))\n",
        "\n",
        "# print(chs.query(\"\"\"SELECT e.movieId,\n",
        "#        m.title,\n",
        "#        e.embedding\n",
        "# FROM embeddings AS e\n",
        "# JOIN movielens.movies AS m ON e.movieId = m.movieId\n",
        "# LIMIT 5\"\"\"))\n",
        "\n",
        "# Join the embeddings table with the movies table to get the title\n",
        "chs.query(\"\"\"CREATE TABLE embeddings_with_title (\n",
        "        movieId UInt32 NOT NULL,\n",
        "        title String NOT NULL,\n",
        "        embedding Array(Float32) NOT NULL\n",
        "    ) ENGINE = MergeTree()\n",
        "ORDER BY movieId AS\n",
        "SELECT e.movieId,\n",
        "       m.title,\n",
        "       e.embedding\n",
        "FROM embeddings AS e\n",
        "JOIN movielens.movies AS m ON e.movieId = m.movieId\"\"\")\n",
        "\n",
        "print(\"Movie Id, Title, Embeddings\")\n",
        "print(chs.query('SELECT * FROM embeddings_with_title LIMIT 5'))\n"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "target_movieId = 318\n",
        "topN = chs.query(f\"\"\"\n",
        "          WITH\n",
        "            {target_movieId} AS theMovieId,\n",
        "            (SELECT embedding FROM embeddings_with_title WHERE movieId = theMovieId LIMIT 1) AS targetEmbedding\n",
        "          SELECT\n",
        "            movieId,\n",
        "            title,\n",
        "            cosineDistance(embedding, targetEmbedding) AS distance\n",
        "            FROM embeddings_with_title\n",
        "            WHERE movieId != theMovieId -- Not self\n",
        "            ORDER BY distance ASC\n",
        "            LIMIT 10\n",
        "          \"\"\", \"Pretty\")\n",
        "print(f\"Scaned {topN.rows_read()} rows, \"\n",
        "      f\"Top 10 similar movies to movieId {target_movieId} in {topN.elapsed()}\")\n",
        "print(\"Target Movie:\")\n",
        "print(chs.query(f\"SELECT * FROM movielens.movies WHERE movieId={target_movieId}\", \"Pretty\"))\n",
        "print(\"Top10 Similar:\")\n",
        "print(topN)\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wvBnbe4Vn7O4",
        "outputId": "0146d7f7-5164-4776-8d5d-5e4f5fc3738a"
      },
      "execution_count": 54,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Scaned 10 rows, Top 10 similar movies to movieId 318 in 0.037433266\n",
            "Target Movie:\n",
            "┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
            "┃ \u001b[1mmovieId\u001b[0m ┃ \u001b[1mtitle                           \u001b[0m ┃ \u001b[1mgenres     \u001b[0m ┃\n",
            "┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
            "│     318 │ Shawshank Redemption, The (1994) │ Crime|Drama │\n",
            "└─────────┴──────────────────────────────────┴─────────────┘\n",
            "\n",
            "Top10 Similar:\n",
            "┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n",
            "┃ \u001b[1mmovieId\u001b[0m ┃ \u001b[1mtitle                           \u001b[0m ┃ \u001b[1m  distance\u001b[0m ┃\n",
            "┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n",
            "│     527 │ Schindler's List (1993)          │ 0.04840994 │\n",
            "├─────────┼──────────────────────────────────┼────────────┤\n",
            "│     593 │ Silence of the Lambs, The (1991) │ 0.06956363 │\n",
            "├─────────┼──────────────────────────────────┼────────────┤\n",
            "│      50 │ Usual Suspects, The (1995)       │ 0.10293013 │\n",
            "├─────────┼──────────────────────────────────┼────────────┤\n",
            "│     296 │ Pulp Fiction (1994)              │ 0.10800046 │\n",
            "├─────────┼──────────────────────────────────┼────────────┤\n",
            "│     356 │ Forrest Gump (1994)              │ 0.15133452 │\n",
            "├─────────┼──────────────────────────────────┼────────────┤\n",
            "│     858 │ Godfather, The (1972)            │ 0.15708566 │\n",
            "├─────────┼──────────────────────────────────┼────────────┤\n",
            "│     110 │ Braveheart (1995)                │ 0.17923981 │\n",
            "├─────────┼──────────────────────────────────┼────────────┤\n",
            "│    2028 │ Saving Private Ryan (1998)       │ 0.19311911 │\n",
            "├─────────┼──────────────────────────────────┼────────────┤\n",
            "│    2858 │ American Beauty (1999)           │ 0.20046866 │\n",
            "├─────────┼──────────────────────────────────┼────────────┤\n",
            "│    2959 │ Fight Club (1999)                │  0.2056213 │\n",
            "└─────────┴──────────────────────────────────┴────────────┘\n",
            "\n"
          ]
        }
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": ".venv",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.2"
    },
    "colab": {
      "provenance": []
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}